FastDocode: Finding Approximated Segments of N-Grams for Document Copy Detection - Lab Report for PAN at CLEF 2010

نویسندگان

  • Gabriel Oberreuter
  • Gaston L'Huillier
  • Sebastián A. Ríos
  • Juan D. Velásquez
چکیده

Nowadays, plagiarism has been presented as one of the main distresses that the information technology revolution has lead into our society for which using pattern matching algorithms and intelligent data analysis approaches, these practices could be identified. Furthermore, a fast document copy detection algorithm could be used in large scale applications for plagiarism detection in academia, scientific research, patents, knowledge management, among others. Notwithstanding the fact that plagiarism detection has been tackled by exhaustive comparison of source and suspicious documents, approximated algorithms could lead to interesting results. In this paper, an approach for plagiarism detection is presented. Results in a learning dataset of plagiarized documents from the PAN’09, and its further evaluation in the PAN’10 plagiarism detection challenge, showed that the trade-off between speed and performance could improve other plagiarism detection algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

University of Sheffield - Lab Report for PAN at CLEF 2010

This paper describes the University of Sheffield entry for the 2nd international plagiarism detection competition (PAN 2010). Our system attempts to identify extrinsic plagiarism. A three-stage approach is used: pre-processing, candidate document selection (using word n-grams) and detailed analysis (using the Running Karp-Rabin Greedy String Tiling string matching algorithm). This approach achi...

متن کامل

A Textual-Based Similarity Approach for Efficient and Scalable External Plagiarism Analysis - Lab Report for PAN at CLEF 2010

In this paper we present an approach to detect external plagiarism based on textual similarity. This is an efficient and precise method that can be applied over large sets of documents. The system that we have developed contains a first phase of document selection that uses a variant of tf -idf applied over the terms that appear in the two documents of the pair being compared. After this is don...

متن کامل

Encoplot - Performance in the Second International Plagiarism Detection Challenge - Lab Report for PAN at CLEF 2010

Our submission this year is generated by the same method Encoplot that we have developed for the last year competition. There is a single improvement, we compare in addition each suspicious document with each other and flag the passages most probably in correspondence as intrinsic plagiarism.

متن کامل

External Plagiarism Detection Based on Standard IR Technology and Fast Recognition of Common Subsequences - Lab Report for PAN at CLEF 2010

The plagiarism detection system described in this paper is aiming at bringing external plagiarism detection to the desktop. The main ideas are to incorporate standard IR technologies for the candidate selection and efficient data structures for the detailed analysis between a suspicious and a candidate document. Given that the system so far has only reached prototype status, the first results l...

متن کامل

CoReMo System (Contextual Reference Monotony) - Lab Report for PAN at CLEF 2010

In this paper a new approach is shown for a very fast monolingual external plagiarism detection system based on an altered n-gram concept (contextual n-gram), a new high precision contextual Information Retrieval engine, and a new pruning strategy (Referential Monotony) for plagiarism detection and its limits. The assessment results can be compared with the carried out by the winner team at PAN...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010